[SPARK-16071][SQL] Checks size limit when doubling the array size in BufferHolder#13829
[SPARK-16071][SQL] Checks size limit when doubling the array size in BufferHolder#13829clockfly wants to merge 4 commits intoapache:masterfrom
Conversation
|
Test build #60984 has finished for PR 13829 at commit
|
| int bitsetWidthInBytes = UnsafeRow.calculateBitSetWidthInBytes(row.numFields()); | ||
| if (row.numFields() > (Integer.MAX_VALUE - initialSize) / 8) { | ||
| throw new UnsupportedOperationException( | ||
| "Cannot create BufferHolder from input UnsafeRow because it is too big."); |
There was a problem hiding this comment.
...too big might be a bit to vague.... Can you use something like ...exceeds the maximum number of variables (268435455).
There was a problem hiding this comment.
...BufferHolder from input UnsafeRow... -> ...BufferHolder for input UnsafeRow...
We only get the numFields from the unsafe row and allocate memory for it.
|
One small comment. LGTM otherwise. |
|
Test build #60993 has finished for PR 13829 at commit
|
| * Grows the buffer by at least neededSize and points the row to the buffer. | ||
| */ | ||
| public void grow(int neededSize) { | ||
| if (neededSize > Integer.MAX_VALUE / 2 - totalSize()) { |
There was a problem hiding this comment.
shall we move this check into the if branch below? then we can just check length * 2 <= Integer.MAX_VALUE and others can understand it very easily as there is a final byte[] tmp = new byte[length * 2]; next line.
There was a problem hiding this comment.
final int length = totalSize() + neededSize;, this can cause integer overflow, as well as length * 2
| public BufferHolder(UnsafeRow row, int initialSize) { | ||
| this.fixedSize = UnsafeRow.calculateBitSetWidthInBytes(row.numFields()) + 8 * row.numFields(); | ||
| int bitsetWidthInBytes = UnsafeRow.calculateBitSetWidthInBytes(row.numFields()); | ||
| if (row.numFields() > (Integer.MAX_VALUE - initialSize - bitsetWidthInBytes) / 8) { |
There was a problem hiding this comment.
I don't quite understand this, we are trying to avoid overflow of this.fixedSize = UnsafeRow.calculateBitSetWidthInBytes(row.numFields()) + 8 * row.numFields(); right? Why we - initialSize here?
There was a problem hiding this comment.
this.buffer = new byte[fixedSize + initialSize];
|
Test build #61428 has finished for PR 13829 at commit
|
There was a problem hiding this comment.
One more thought: Can we grow the buffer to Integer.MAX_VALUE if we can't double its size? Then we have another chance to continue the execution and finish it.
There was a problem hiding this comment.
Currently the limit for neededSize + totalSize is Integer.MAX_VALUE / 2, I don't see there is a big difference to enlarge the limit to Integer.MAX_VALUE.
Integer.MAX_VALUE / 2 is about 1 GB, it is quite rare for a single row to exceed this limit.
There was a problem hiding this comment.
I think it's good to try our best to finish user's job instead of failing it. And it's not a lot of work, should be worth it, just grow the buffer to Integer.MAX_VALUE when neededSize + totalSize is between Integer.MAX_VALUE / 2 + 1 and Integer.MAX_VALUE
|
Test build #61452 has finished for PR 13829 at commit
|
3a831e0 to
4265771
Compare
There was a problem hiding this comment.
nit: Integer.MAX_VALUE /2 -> Integer.MAX_VALUE / 2, you missed a space...
|
LGTM except some style comments |
|
Test build #61514 has finished for PR 13829 at commit
|
|
Test build #61515 has finished for PR 13829 at commit
|
|
Test build #61517 has finished for PR 13829 at commit
|
…BufferHolder ## What changes were proposed in this pull request? This PR Checks the size limit when doubling the array size in BufferHolder to avoid integer overflow. ## How was this patch tested? Manual test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13829 from clockfly/SPARK-16071_2. (cherry picked from commit 5320adc) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
thanks, merging to master/2.0! |
| var e = intercept[UnsupportedOperationException] { | ||
| new BufferHolder(new UnsafeRow(Int.MaxValue / 8)) | ||
| } | ||
| assert(e.getMessage.contains("too many fields")) |
There was a problem hiding this comment.
Should this string be defined in BufferHolder and referenced here so that the test wouldn't break if the exception message is modified ?
What changes were proposed in this pull request?
This PR Checks the size limit when doubling the array size in BufferHolder to avoid integer overflow.
How was this patch tested?
Manual test.